Group 3
December 14, 2017
The objectives were to create summaries and visualizations of how the dependent variable is associated with the different independent variables. Our goal was to develop different models to analyze these associations in the data. Our work included:
If we can discover specific characteristics of the drugs that are associated with effectiveness against TB, this will help the researchers understand varying drug mechanisms in the body using a mouse model.
The dependent variables tested included drug concentrations in the lung tissue and spleen. The independent variables included several in vivo (mouse model) and in vitro tests that were performed by the TB research group.
Image retrieved from: https://www.pinterest.com/pin/233624299389735946/
linear_model <- function(peak_trough, dep_var,
data = efficacy_summary) {
function_data <- data %>%
filter(level == peak_trough) %>%
gather(key = independent_var, value = indep_measure,
-drug, -dosage, -dose_int, -level, -ELU, -ESP,
na.rm = TRUE) %>%
select(drug, dosage, dose_int, level, dep_var,
indep_measure, independent_var)
if(dep_var=="ELU")
{function_data$vect <- function_data$ELU}
if(dep_var=="ESP")
{function_data$vect <- function_data$ESP}
model_function <- function(data) {
model_results <- lm(vect ~ scale(indep_measure),
data = data)
} estimate_results <- function_data %>%
group_by(independent_var, dose_int) %>%
nest() %>%
mutate(mod_results = purrr::map(data,
model_function)) %>%
mutate(mod_coefs = purrr::map(mod_results,
broom::tidy)) %>%
select(independent_var, dose_int, mod_results,
mod_coefs) %>%
unnest(mod_coefs) %>%
filter(term == "scale(indep_measure)") coef_plot <- estimate_results %>%
mutate(independent_var = forcats::fct_reorder(
independent_var, estimate, fun = max)) %>%
rename(Dose_Interval = dose_int) %>%
ggplot(aes(x = estimate, y = independent_var,
color = Dose_Interval)) +
geom_point(aes(size = 1 / std.error)) +
scale_size_continuous(guide = FALSE) +
theme_few() +
ggtitle(label = "Linear model coefficients as function
of independent variables, \n by drug dose and
model uncertainty", subtitle = "Smaller points
have more uncertainty than larger points") +
geom_vline(xintercept = 0, color = "cornflower blue")
coef_plot
}linear_modelefficacy_summarydep_var options: “ELU” (lung efficacy) or “ESP” (spleen efficacy)peak_trough options: “Cmax” or “Trough”#Sample code for function, linear_model (Cmax and ELU)
linear_model(peak_trough = "Cmax", dep_var = "ELU")#Sample code for function, linear_model (Cmax and ESP)
linear_model(peak_trough = "Cmax", dep_var = "ESP")Coefficients that are far right or far left are most strongly associated relationships between independent and dependent variables
If the coefficient is negative, for example, as it is with MacUptake in the ELU linear regression model, an interpretation would be for every unit of change in the MacUptake, the ELU will decrease by 0.5 Units. Therefore, MacUptake has a negative relationship with ELU. The diameter of the point represents the level of certainty of the coeficient in this model. This may change as more data is collected for each drug.
rpart(ELU ~ drug + dosage + level +
plasma + `Uninvolved lung` + `Rim (of Lesion)` +
`Outer Caseum` + `Inner Caseum` +
`Standard Lung` + `Standard Lesion` + cLogP +
`Human Plasma Binding` +
`Mouse Plasma Binding` + `MIC Erdman Strain` +
`MIC Erdman Strain with Serum` +
`MIC rv strain` + `Caseum binding` +
`Macrophage Uptake (Ratio)`,
data = function_data,
control = rpart.control(cp = -1,
minsplit = min_split,
minbucket = min_bucket))dep_var options: “ELU” (lung efficacy) or “ESP” (spleen efficacy)min_split: numeric input indicating minimum # observations for a split to be attemptedmin_bucket: numeric input indicating minimum # observations in a terminal nodedata = efficacy_summary (default; must use this to run properly)regression_tree(dep_var = "ELU", min_split = 8,
min_bucket = 4)min_bucket = 4).min_split = 8).min_split or the min_bucket parameters are fulfilled for each node.Background We want to predict our outcome using the varibles we have in front of us; it is the next generation of step-wise regression anf can handle more varaibles than samples.
LASSO_model <- function(dep_var, dose, df = efficacy_summary) {
data <- na.omit(df) %>%
select_if(is.numeric) %>%
filter(dosage == dose)
response <- df %>%
select(dep_var)
predictors <- df %>%
select(c("PLA", "ULU", "RIM", "OCS", "ICS", "SLU", "SLE", "cLogP",
"huPPB", "muPPB", "MIC_Erdman", 'MICserumErd',
"MIC_Rv", "Caseum_binding", "MacUptake"))
y <- as.numeric(unlist(response))
x <- as.matrix(predictors)fit = glmnet(x, y)
coeff <- coef(fit,s=0.1)
coeff <- as.data.frame(as.matrix(coeff))
}LASSO_model(dep_var = "ELU", dose = 50)| predictor | coeff |
|---|---|
| (Intercept) | 1.2911027 |
| cLogP | 0.2908215 |
| muPPB | 0.0049209 |
Goal Determine which variables are the most important for predicting either lung efficacy (ELU) or spleen efficacy (ESP)
Output Interactive plotly graph displaying levels of importance
dep_var: The user must specify the dependent variable of interest which can either be “ELU” or “ESP”. If the user does not choose a dependent variable the function automatically utilizes “ELU”.
drug: This call line allows the user to either include or exclude the variable drug which corresponds to the drug being tested. The user can either specify TRUE or FALSE. The function defaults to FALSE where the variable drug is excluded.
df: This call specifies which data frame to use for the function. The default data frame is efficacy summary; however, the user can specify another data frame if they choose.
efficacy.rf <- randomForest( ELU~ ., data =dataset,
na.action = na.roughfix,
ntree= 500,
importance = TRUE)ELUbest_variables("ELU", drug = FALSE)ESPbest_variables("ESP", drug = TRUE)The larger the mean squared error the more important the variable.
MIC Rv Strain, Mouse Plasma Binding, and Caseum Binding are the most important
In vitro variables seem to be better predictors than in vivo variables
These functions may be prone to several errors if:
Input datasets have low or high number of observations
Missing data are recorded differently. We noticed in the individual drug data, the “NA”, or missing data for spleen_efficacy had a space before the NA. This type of variation in how missing data is recorded could cause problems for the functions.
If drug names or codes change, this could create potential problems
If new independent variables or measurements are added to the dataframe
The dataset provided included two dose frequency combinations, 50 BID and 100 QD. If these dose and frequency combinations change it could cause problems with some of the functions.
The next step is to include more data representing more drugs
Validating the models (for RandomForest and Lasso) would help us understand the predictive power of the model to determine drug efficacy. The data can be subsetted and tested.
These predictive models could be combine in a function to output all of the coefficients; then compared.